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The Yellowstone geothermal complex contains over 10,000 diverse geothermal features 
that host numerous phylogenetically deeply rooted and poorly understood archaea, bac- 
teria, and viruses. Microbial communities in high-temperature environments are generally 
less diverse than soil, marine, sediment, or lake habitats and therefore offer a tremen- 
dous opportunity for studying the structure and function of different model microbial 
communities using environmental metagenomics. One of the broader goals of this study 
was to establish linkages among microbial distribution, metabolic potential, and envi- 
ronmental variables. Twenty geochemically distinct geothermal ecosystems representing 
a broad spectrum of Yellowstone hot-spring environments were used for metagenomic 
and geochemical analysis and included approximately equal numbers of: (1) phototrophic 
mats, (2) "filamentous streamer" communities, and (3) archaeal-dominated sediments. 
The metagenomes were analyzed using a suite of complementary and integrative bioin- 
formatic tools, including phylogenetic and functional analysis of both individual sequence 
reads and assemblies of predominant phylotypes.This volume identifies major environmen- 
tal determinants of a large number of thermophilic microbial lineages, many of which have 
not been fully described in the literature nor previously cultivated to enable functional and 
genomic analyses. Moreover, protein family abundance comparisons and in-depth analyses 
of specific genes and metabolic pathways relevant to these hot-spring environments reveal 
hallmark signatures of metabolic capabilities that parallel the distribution of phylotypes 
across specific types of geochemical environments. 
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INTRODUCTION 

The Yellowstone hotspot is responsible for an enormous number 
( > 14,000) and diversity of thermal features that cover a wide range 
in pH (2-10), temperature (40-92°C), and geochemical properties 
(Fournier, 1989; Rye and Truesdell, 2007). The waters, rocks, and 
mineral surfaces in these geothermal sites provide an assortment 
of electron donors such as hydrogen, sulfide, and ferrous iron, 
as well as electron acceptors (e.g., dissolved oxygen or elemental 
S) that are vital to the survival of thermophilic microorganisms 
(Brock, 1978). Remarkably, many sites exhibit relative stability or 
regular patterns of temporal change, despite the dynamic nature of 
geothermal activity (Ball et al., 1998, 2002, 2010; McCleskey et al., 
2005). Microbial communities in high-temperature environments 
are often dominated by just several types of microorganisms (phy- 
lotypes), and in general are significantly less diverse than lower- 
temperature habitats. Geothermal systems thus serve as valuable 
models for establishing linkages between genomic potential and 
environmental parameters, and for understanding environmental 



controls on community structure and function. Moreover, many 
current thermal environments in YNP harbor modern-day analogs 
of microbial communities potentially important in ancient Earth, 
perhaps most notably, cyanobacterial mats that may have analogs 
dating back to the beginning of oxygenic photosynthesis and the 
"Great Oxidation Event" (Canfield, 2005; Konhauser, 2006, 2009; 
Sessions et al., 2009). Consequently, the expansive field site of YNP 
provides a natural laboratory for studying microbial evolution 
and adaptation across uniquely protected and preserved thermal 
environments. 

The evolutionary history of all three branches of Life (Bac- 
teria, Archaea, and Eucarya) is intertwined with major geolog- 
ical events that have shaped the planet (Nealson and Ghiorse, 
2001; Konhauser, 2006). Relationships between geological events 
and microbial function are subjects of considerable debate and 
interest across numerous disciplines, and provide a strong basis 
for interdisciplinary collaboration. The extant global distribution 
of microorganisms reveals an astounding diversity of functional 



www.frontiersin.org 



May 2013 | Volume 4 | Article 67 1 1 



Inskeep et al. 



TheYNP metagenome project 




FIGURE 1 | Map of Yellowstone National Park (YNP) showing the 
locations of 20 geothermal sites sampled for metagenome and 
geochemical analysis (green = phototrophic sites; 
yellow/red = archaeal sites; blue = Aquificales "streamer" 
communities; gray = Obsidian Pool Prime; site numbers defined in 
Table 1). The dashed line represents the boundary of the most recent 
caldera. 



attributes, many of which are specialized to specific physical 
or chemical constraints. The importance of microbially medi- 
ated reactions in the geosciences has been discussed (Newman 
and Banfield, 2002; Reysenbach and Shock, 2002; Falkowski 
et al., 2008), and numerous metabolic pathways are directly 
involved in key global processes such as nutrient cycling (e.g., 
C, N, S, Fe), gas exchange (CO2 and CH4), and trace element 
cycling (e.g., As, Sb, Hg). However, the microbial complexity 
typical of temperate soil and aquatic habitats (e.g., Madigan 
et al., 2012) has contributed to the difficulty in understand- 
ing specific linkages between community members and mate- 
rial cycling (Tringe et al, 2005; DeLong et al., 2006; Rusch 
et al., 2007). Metagenomics of high-temperature communities 
in YNP provides an unparalled opportunity to study specific 
microbiological associations across widely different geochem- 
istry and mineralogy under highly controlled, pseudo-steady-state 
conditions. 

Metagenome sequencing of environmental DNA has provided a 
useful tool for studying microbial community structure and func- 
tion in numerous environments (e.g., Tyson et al., 2004; Tringe 
et al, 2005; Baker et al, 2006; DeLong et al, 2006; Rusch et al, 
2007; Dick et al., 2009). Prior metagenome sequencing in geot- 
hermal systems of YNP has been used to dissect phototrophic mat 
communities (Klatt et al., 20 1 1 ), and establish linkages among geo- 
chemical processes and microbial populations of chemotrophic 
communities (Inskeep et al, 2010). These studies provide specific 
examples of what can be gained from detailed phylogenetic and 
functional analysis of metagenome sequence. Here we describe, 
characterize and compare metagenomes and associated metadata 
collected across 20 different geothermal sites (Figure 1), which 
were selected to cover some of the major types of geochemical 
and microbiological systems active in YNP. The specific objec- 
tives of this manuscript were to (i) discuss rationale for the study 
design that focused on specific linkages between environmental 
parameters and microbial community structure, (ii) demonstrate 
the primary geochemical and geophysical attributes that separate 
different high-temperature microbial habitats, and (iii) provide 
an overview of metagenome sequence content and high-level 
(TIGRFAM) functional analyses that serve to introduce subse- 
quent studies focused on three main geobiological ecosystem 
types: (1) phototrophic mats; (2) Aquificales-rich "filamentous- 
streamer" communities; and (3) archaeal-dominated sediments. 
These studies are the outgrowth of one of the major collaborative 
activities of an NSF Research Coordination Network, established 
to promote and coordinate research focused on geothermal biol- 
ogy and geochemistry in YNP. Importantly, the research presented 
here represents the first comprehensive and coordinated non- 
PCR based survey of microorganisms distributed across a broad 
spectrum of high-temperature habitats in YNP. 

RESULTS AND DISCUSSION 
GEOCHEMISTRY AND ENVIRONMENTAL CONTEXT 

A geo-referenced geothermal database developed by the Yellow- 
stone Center for Resources (U.S. National Park Service) contains 
information for numerous geothermal features including pH, tem- 
perature, electrical conductivity (EC), and site context (with image 
cataloging). A version of this database (~8000 entries) is available 



at the RCN YNP website , as well as other data on geothermal 
sites collected by either the U.S. Geological Survey (e.g., Ball et al., 
1998, 2002, 2010; McCleskey et al, 2005) or individual researchers. 
The distribution of pH values across geothermal features in YNP 
is bimodal with peaks near pH 2.5 and 6.5 (Figure 2, based on 
n ~ 7700). While sites chosen for this study (colored bars along x- 
axis) strategically cover the bimodal pH ranges observed in YNP 
(Figure 2), the microbial communities inhabiting this wide range 
in pH are also strongly influenced by temperature, oxygen, sulfide, 
and other physical variables. EC values for a similar number of sites 
(«~6450) show that most geothermal systems in YNP exhibit 
EC values ranging from 0.1-3 dS/m (or mmho/cm) (Figure 2), 
which corresponds to ionic strengths (I) of ~0. 005-0.04 M. Com- 
pared to marine (I ~ 0.7 M), and other saline environments, the 
geothermal habitats sampled herein have significantly lower levels 
of dissolved ions, which would not be expected to impart sig- 
nificant environmental selection toward halophilic (high salt) or 
alkaliphilic (high-pH) microorganisms. 

The current study was designed to cover a broad range of geo- 
chemical and temperature conditions, and to represent several of 
the major types of geothermal systems in YNP (Figure SI in Sup- 
plementary Material) . The breadth of habitats sampled capitalizes 
on a range of several key "system -defining" variables that give rise 
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FIGURE 2 | Distribution of pH (A) and electrical conductivity (EC) (B) 
values across 7700 and 6450 geothermal features, respectively, 
cataloged in the Yellowstone Center for Resources (YCR) database 
(available on line via www.rcn.montana.edu). Vertical bars along pH and 
EC axes indicate values of the 20 sites described in the current study 
(Aquificales "streamer" communities = blue; archaeal-dominated 
sites = red; phototrophic mats = green). Ionic strengths ranging from 
~0. 005-0. 04 M are indicated with dashed lines (WS_18 has the highest EC 
in the study of ~3.2 mmho/cm). 



to different niches for thermophilic microorganisms. These key 
environmental variables and resulting habitat types are shown as a 
decision-tree (Figure 3), which is a descriptive tool to visualize the 
study design (showing replication of similar sites when possible). 
The nodes in the decision-tree also serve to represent hypotheses 
of the primary environmental factors that are most important for 
defining niches occupied by specific phylotypes. The correspond- 
ing metadata (geochemistry and physical context) associated with 
each of these sites are provided as supplemental material (Table 
S2 in Supplementary Material), but key attributes (T, pH, sulfide, 
or elemental S, and physical context) of each site are included 
here to understand the predominant site groupings, as well as 
the targeted communities and/or populations the study aimed to 
elucidate (Table 1; Figure 3). 



The separation of sites as a function of pH represents a key 
decision point in the study design. As the first node, pH allows 
for an important distinction between the effective temperature 
limit for colonization by known phototrophs in acidic (pH < 5) 
versus more neutral (pH 5-9) environments. It is well established 
that the upper temperature limit for oxygenic phototrophs is ~70- 
74°C (Brock, 1978; Madigan et al, 2012), but cyanobacteria do not 
generally colonize habitats below pH 4.5-5. Consequently, at pH 
values < 5, a different phototrophic limit is established near 54- 
56°C, corresponding to the upper temperature limit of members 
of the Cyanidales (red algae), diatoms, or possible representa- 
tives from the mildly acidophilic purple-bacteria (Toplin et al, 
2008; Madigan et al., 2012). Consequently, bacterial phototrophic 
mats are confined to the upper node of the decision-tree, and 
are not expected as dominant organisms in habitat types below 
pH 5. Diverse eukaryotic (algal) phototrophic mats exist in YNP 
(<55°C), but these were not considered in the current study. The 
phototrophic sites (5, 6, 7, 15, 16, and 20) were chosen to target 
communities dominated by either oxygenic phototrophs (OP), 
filamentous anoxygenic phototrophs (FAPs), or anoxygenic pho- 
totrophs (AP). As noted, the sample from Obsidian Pool Prime 
(OPP_17) was obtained from biofilm growth on a large glass 
plate and represents a different physical context relative to other 
phototrophic sites. 

Both low- and high-pH sites are further separated based on 
the presence or absence of sulfide and/or elemental S (Figure 3). 
This is an important variable because high sulfide concentrations 
also imply low oxygen (hypoxia). When sulfide levels are above 
detection (or if significant elemental sulfur and/or thiosulfate are 
present), there is generally rapid abiotic consumption of oxygen 
by reduced S species (Nordstrom et al., 2005, 2009). This does not 
rule out the possibility that O2 influx from the atmosphere will 
influence the resulting microbial community, but systems with 
plentiful sulfide and/or elemental S will (i) contain little to no 
detectable dissolved oxygen, and (ii) contain species of S that can 
serve as potential electron donors or acceptors. Utilization of an 
O2 variable in the decision-tree does not offer the same utility; low 
O2 does not imply the presence of reduced S species. Some geot- 
hermal springs discharge non-sulfidic, hypoxic waters that contain 
other reduced constituents such as ferrous Fe (e.g., CP_7, OSP_8). 

Two additional variables related to physical context are nec- 
essary to explain the study design and the corresponding habitat 
types included for metagenome analysis. Although these variables 
are included here as a simple yes or no, they represent important 
physical determinants of microbial community structure. Gradi- 
ents in oxygen, light quantity, and quality, and sulfide are sub- 
stantial as a function of mat depth and influence the distribution 
of microorganisms (Ward et al, 1998; Ramsing et al., 2000; van 
der Meer et al., 2005). For instance, sub-surface mat (SSMAT) 
layers are known to be less oxygenated and generally contain a 
greater abundance of FAPs relative to surface layers, which show a 
greater abundance of cyanobacteria (Ward et al., 2006; Klatt et al, 
2011). Two of the phototrophic sites sampled in this study were 
from sub-surface mat positions at Mushroom Spring (MS_15) 
and Fairy Geyser (FG_16), where mat dissection was conducted 
to focus on novel FAP communities discovered in prior stud- 
ies (Ward et al., 1998; Boomer et al, 2002), and because other 
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FIGURE 3 | The thermophilic microbial communities included in the YNP 
Metagenome Project were separated into predominant habitat types 
using a decision-tree with four "nodes" or primary environmental 
factors: pH, temperature ( C), the presence or absence of dissolved 
sulfide and/or elemental S, and additional physiographic context (4a: 
SSMat = sub-surface sample obtained from laminated phototrophic 
mats; 4b: Flow = primary flow channel). Decision nodes are represented by 
brown squares, node output by blue rectangles, and decision endpoints by 
triangles (i.e., predominant habitat types, site numbers). Targeted habitat 



types: FAP = filamentous anoxygenic phototrophs, OP = oxygenic 
phototrophs, AP = anoxygenic phototrophs, AQ-Therm/Sulf/Hydro = 
Aquificales "streamer" communities dominated by one of three major 
Aquificales lineages (i.e., Thermocrinis, Sulfurihydrogenibium, 
Hydrogenobaculum); S-SED = bottom or suspended sulfur-rich sediments 
dominated by archaea; Fe-Oxide = Fe-oxyhydroxide microbial mat dominated 
by archaea. The bulk aqueous characteristics of OPP_17 place this sample 
closest to sites WC_6 and CP_7 in the decision-tree, however, this sample 
was collected from biofilm growth on a large glass plate (see text, Table 1). 



metagenome investigations were focused on the top layers of the 
MS_15 phototrophic mats (Klatt et al, 2011; Liu et al, 2011). 
Other phototrophic samples (CP, WC, BLVA) were obtained from 
surface mats that occur within geothermal outflow channels, and 
thus flow is an implied physical variable that defines these habitats. 

Hydrodynamic conditions (i.e., flow rate, turbulence) are cru- 
cial for defining the microenvironmental context of microor- 
ganisms in geothermal settings. All Aquificales "streamer" 
communities (DS_9, MHS_10, CS_12, OSP_14, OS_ll, BCH_13) 
were obtained within the primary flow path of high-velocity (0.1- 
0.3 ms~') outflow channels, as well as site OSP_8 (Fe-oxide mat), 
which is immediately down-gradient of OSP_14. Members of 
the deeply rooted bacterial Aquificales are known for assuming 
filamentous morphology in turbulent environments where the 
discharge of hot reduced waters occurs in the presence of air or 
oxygenated waters (e.g., marine hydrothermal vents, Reysenbach 
et al, 2000). Flow and turbulence of discharged geothermal waters 
encourages the degassing of H2S(aq) as well as the ingassing of 
02(g) (Inskeep et al, 2005; Nordstrom et al., 2005). Consequently, 
the high-velocity, stream-channel habitats are considerably dif- 
ferent from geothermal pools, which accumulate sediments of 
various sulfides, S°, alunite, kaolinite, and polymorphs of Si02, 
depending on specific geochemical conditions such as pH, and 
levels of H 2 S, As, Sb, and Fe (Inskeep et al, 2009). Sulfidic 
and elemental S-rich sediments (CH_1, NL_2, MG_3, CIS_19, 



JCHS_4, WS_18) were sampled from low-flow habitats over a 
wide pH range (2.5-6.1) at temperatures considerably greater 
than the phototrophic limit (Figure 3). Sulfidic sites less than 
pH 5-6 can be further separated based on pH, an important 
variable influencing the distribution of specific crenarchaea (e.g., 
organisms within the order Sulfolobales increase in abundance 
with decreasing pH). 

The oxidation of aqueous Fe(II) to solid-phase ferric oxyhy- 
droxide is an exergonic reaction under most hydrothermal condi- 
tions (Amend and Shock, 2001; Amend et al, 2003; Inskeep et al., 
2005), but of course requires dissolved O2. The in-channel flow 
environments promote equilibration with atmospheric conditions 
and encourage O2 influx. The acidic Fe-oxide mat (0.5-1 cm 
depth) sampled in Norris Geyser Basin (OSP_8) occurs within 
a high-velocity channel where oxygenation results in 02(aq) 
concentrations of ~ 40-60 |xM, which are 20-30% of theoreti- 
cal saturation at this temperature and pressure. The concentra- 
tion of Fe(II) does not appear in the decision-tree, because pH 
and the inference of oxygen (e.g., lack of sulfide) are already 
included as major determinants (concentrations of total Fe gen- 
erally increase three orders of magnitude per pH unit decrease). 
Consequently, it is not the concentration of Fe per se that deter- 
mines whether Fe(II) is oxidized, but whether the hydrodynamic 
or physical setting promotes oxygenation. For example, the more 
acidic CH_1 site contains ~5 times more soluble Fe(II) than 
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Table 1 | Site names, abbreviations, and minimal set of environmental metadata necessary to separate different microbial community types 
(temperature, pH, dissolved sulfide or the presence of elemental S; Figure 3). 



Site 


Site 

Abbrev. 


T( C) 


PH 


DS 
(uM) 


S 


Physical 
Context 


Target Phylum/Other 


White Creek 


WC 6 


48-50 


8.2 


<1 


N 


Flow channel, 
thermal creek 


Cyanobacteria (Mastigocladus sp.), 
Chloroflexi 


Chocolate Pots 


CP 7 


52 


6.2 


<1 


N 


Flow channel, 
spring source 


Cyanobacteria, Chloroflexi, 
Rhodococcales 


iviusnroom opnny 


IVIO 1 0 


rh 


O.tL 
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M 
In 


Flow channel, 
suusurrace mat 


Chloroflexi (Red), 

L/yanoDacieria [synecnococcus sp.j 


rairy ueyser 


Pfi 1 R 
rO I D 


OD-OO 


Q 1 


1 

< 1 


M 
In 


Splash mat, 
suDsurrace mat 


L/MioroTiexi ^neuj 


Ddin LaKu VISIa 


DLVrt 0 


£7 
00-0/ 




117 
1 1 / 


V 
Y 


Flow channel, 
near source 


L/MioroTiexi ^oreenj 


Ddin LdKc VISIa 


Rl \/A on 


04-00 




117 
I I / 


V 
Y 


Flow channel, 
near source 


Chloroflexi (Green), Gammaproteo- 
uaciena ( / noririocnroiTisiiuiTi sp.j 


Obsidian Pool 
Prime 


wr r I / 


00 


R 7 
0. / 


^1 
< I 


M 
In 


Large Pool 
oiass suae 


Dacienai vjr uivisions 


Dragon Spring 


DS 9 


68-72 


3.1 


80 


Y 


Flow channel 


A ■■£■ 1 /III 1 1 \ 

Aquificales {Hydrogenobaculum) 


100 Spring Plain 


OSP 14 


72-74 


3.5 


10 


N 


Flow channel 


A 1 /III 1 1 \ 

Aquificales {Hydrogenobaculum) 


Mammoth Hot Spr. 


MHS 10 


70-72 


6.5 


70 


Y 


Flow channel 


Aquificales (Sulfurihydrogenibium) 


Calcite Springs 


CS 12 


74-76 


7.8 


105 


Y 


Flow channel 


A 1 / /O l£ ' 1 1 '1 ■ \ 

Aquificales (Sulfurihydrogenibium) 


Octopus Spring 


OS 11 


80-82 


7.9 


<1 


N 


Flow channel 


Aquificales( Thermocrinis) 


Bechler Spring 


BCH 13 


80-82 


7.8 


<1 


N 


Flow channel 


Aquificales( Thermocrinis) 


Crater Hills 


CH 1 


76 


2.6 


<3 


Y 


Sediment, 
Turbulent Pool 


Crenarchaeota (Sulfolobales) 


Nymph Lake 


NL 2 


88 


~4 


<3 


Y 


Sediment, 
Turbulent Pool 


Crenarchaeota (Sulfolobales) 


Monarch Geyser 


MG 3 


78-80 


4.0 


25 


Y 


Sediment, 
Pool Edge 


Crenarchaeota (Desulfurococcales, 
Thermoproteales) 


Cistern Spring 


CIS 19 


78-80 


4.4 


20 


Y 


Sediment, 
Pool Edge 


Crenarchaeota (Desulfurococcales, 
Thermoproteales) 


Joseph's Coat 
Spr. 


JCHS_4 


80 


6.1 


20 


Y 


Sediment 
Pool Edge 


Crenarchaeota (Desulfurococcales, 
Thermoproteales) 


Washburn Spring 


WS 18 


76 


6.4 


160 


Y 


Sediment 
Pool Edge 


Crenarchaeota 
Aquificales 


100 Spring Plain 


OSP_8 


72 


3.4 


<3 


N 


Flow channel, 
Fe-oxide mat 


Crenarchaeota 
Mixed, Novel Archaea 



The three major habitat types sampled include phototrophic mats (green), filamentous "streamer" communities (blue), and archaeal-dominated, elemental sulfur-rich 
or Fe-oxide sediments (yellow and red, respectively). 

' Physical context refers to location of sample: Examples include: dissected sub-surface layer (under-mat) within a phototrophic mat; high-velocity channels containing 
filamentous "streamer" communities or down-gradient Fe-oxide mats; S-rich sediment in various geothermal pools 

2 Target phylum as discussed in submitted "Rl-Files " and based on prior molecular, culture, microscopic and/or other investigations performed by individual investiga- 
tors included in this project (Table SI in Supplementary Material). 

3 The sample from Obsidian Pool Prime (OPP_17) represents a different habitat type and was obtained from a large thermal pool after colonization of a glass plate. 
Physicochemical characteristics place this sample closest to "phototrophic mats" in the current study. 



OSP_8 (250 versus 45 |xM), but due to the consistent geot- 
hermal delivery of low levels of H2S(g) and the ubiquity of 
solid-phase elemental sulfur in CH_1, the consumption of any 
O2 is likely driven by reduced phases of sulfur rather than by 



Fe(II). Consequently, Fe-oxides are not present in the thermal 
pool of CH_1, although evidence of these phases was observed 
where the air-water interface meets rocks along the edge of the 
pool. 
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The study design captures a significant swath of high- 
temperature habitats in YNP. It is important to integrate the 
primary variables responsible for the separation of these habi- 
tat types together with analysis and interpretation of the 
metagenomes. The decision-tree (Figure 3) is also useful for iden- 
tifying habitat types not included in the study, such as acidic 
geothermal habitats less than 55-60°C, in which members of 
the Cyanidales (red algae) are commonly observed (Toplin et al., 
2008), and in which moderately thermophilic heterotrophic bac- 
teria are known to increase in abundance (Macur et al., 2004; 
Kozubal et al., 2012). The sample collected at Obsidian Pool Prime 
(OPP_17) represents a different habitat context relative to other 
sites included in the study and was obtained from 2 months growth 
on a large (625 cm 2 ) glass slide suspended in the water column at 
56°C (pH = 5.7). OPP is a large pool (~0.5 ha) receiving mixed 
water inputs, and is the only sample in the study to be collected 
from the photic zone of a large water-body, albeit heavily influ- 
enced by geothermal inputs. Although the physical context is 
considerably different than other springs and pools included in 
the study, characteristics of OPP_17 place it near the phototrophic 
sites in the decision-tree (Figure 3). 

SEQUENCING OVERVIEW AND MAJOR ASSEMBLIES 

One of the primary aims of the study was to utilize the distinct 
geochemical differences across these 20 sites as a basis for under- 
standing the distribution and function of different thermophiles. 
Moreover, given the uncertainties in predicting in situ metabo- 
lism solely using thermodynamic favorability or inference based 
on distantly related cultured isolates, it is useful to determine the 
actual organisms and genes present in high-temperature habitats 
as a means of constraining the metabolic possibilities and focus- 
ing our efforts on specific electron transfer reactions or catabolic 



pathways. The total metagenome sequence (Sanger) per site ranged 
from 20 to 60 Mb (Figure 4), but the ratio of assembled sequence to 
total sequence varied considerably across these microbial commu- 
nities. The total assembled sequence (i.e., the sum of total contig 
length) reflects the collapse of redundant sequence reads into 
contigs that correspond to the predominant populations present 
(Table 1). Consequently, for communities dominated by a single 
phylotype (e.g., MHS_10, NL_2), the total assembled sequence 
was only ~2Mbp, and there were few singlet reads remaining 
after assembly (nearly 90% of the sequences assembled into con- 
tigs in these cases). In contrast, more assembled sequence was 
obtained in more diverse communities with a greater number of 
dominant phylotypes (e.g., CP_7, BLVA, OS_ll, WS_18), but at 
lower contig coverage. Moreover, the significant number of singlet 
reads remaining after assembly in many sites suggested that the 
sequencing coverage was insufficient to allow genome assembly of 
potentially important members of these communities. 

The predominant phylotypes present in these three sample 
groups include expected microorganisms that have been the focus 
of prior studies at these exact locations (Table 1), as well as 
novel microorganisms in both the Bacteria and Archaea. Mem- 
bers of the Chloroflexi, Cyanobacteria, Chlorobi, Firmicutes, 
Bacteroidetes, Acidobacteria, and Proteobacteria dominated the 
phototrophic mats (Figure 5). The only sulfidic phototrophic site 
(BLVA) included in the study was sampled twice (~8 months 
apart), before and during a bloom of purple sulfur bacteria 
(Gamma-proteobacteria), and both samples contained a strong 
signature of green Chloroflexi. Each of the in-channel "streamer" 
communities were dominated by one of three major Aquificales 
lineages specialized to variations in pH and sulfide (Figure 5). 
However, there was considerable variation in the other organ- 
isms present across the six Aquificales communities. The low-pH 




FIGURE 4 | Total amount of Sanger sequence (Mbpair) obtained for each 
site (top of open bars). The sum of assembled sequences (e.g., sum of all 
contig lengths) as well as that for all singleton sequences is shown as dark 



and light stacked bars, respectively [e.g., 62 Mbp of total sequence was 
obtained for CP_7 which resulted in 19 Mbp of assembled sequence (total 
contig length) and ~14 Mbp of singleton sequences]. 
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FIGURE 5 | Principal components analysis of oligonucleotide word 
frequencies (i.e., k-mers) of assembled metagenome sequence data for 
each group of sites. (A) Phototrophic, (B) "Streamer," and (C) Archaeal. 
Left Column: Assembled sequence data colored by site [Phototrophic Mats: 
BLVA_5, 20 = light-purple, dark-purple; WC_6= light-blue; CP_7 = brown; 
MS_15 and FG_16 did not contain a significant number of 
contigs > 10,000 kb. "Streamer" Communities: DS_9 = yellow; 



OSP_14=red; MHS_10 = green; CS_12 = violet; OS_11 =dark-blue; 
BCH_13 = light-blue. Archaeal Systems: CH_1 =gold; NL_2 = yellow; 
MG_3 = green; CIS_19 = light-blue; JCHS_4 = dark-blue; OSP_8 = red; 
OSP_14 = pink). Sites OPP_17 and WSJ 8 not included. Right Column: 
Phylogenetic assignment of sequence clusters (identical PCA orientation) 
based on comparative sequence analysis of all contigs to reference 
genomes (APIS, Badger et al., 2006). 



Aquificales communities (DS_9, OSP_14) contained significant 
sub-populations of different archaea (separated by the amount 
of sulfide), as did sites with higher pH values (OS_ll, CS_12, 
BCH_13). In contrast, the MHS_10 sample contained no archaeal 
sequence and was dominated by one genus within the Aquificales 
(Sulfurihydrogenibium-\ike) . 

The high-temperature sulfur sediments and Fe-oxide mat sam- 
ples were dominated by archaeal sequence reads (i.e., >85%) 
and contributed an extensive diversity of archaeal protein fam- 
ilies (TIGRFAMS) not currently represented in environmental 
metagenome data, with the exception of replicate sites included in 
a smaller study of YNP chemotrophic communities (Inskeep et al, 
2010). The sites dominated by Sulfolobales (CH_1, NL_2) repre- 
sented a low-pH extreme and these communities are compared 



in more detail to other sulfur sediments at moderate, but slightly 
acidic pH values (pH 4-6), which are dominated by members 
of the orders Desulfurococcales and Thermoproteales (sites 
MG_3, JCHS_4, CIS_19) (Figure 5). In contrast, the acidic Fe- 
mats (OSP_8, 14) contained Fe(II) -oxidizing Sulfolobales (Met- 
allosphaera sp.) and novel archaeal populations observed only 
in the absence of significant levels of dissolved sulfide (Table 1; 
Figure 5). 

RELATIVE DISTRIBUTION OF PROTEIN FAMILIES 

To provide an independent assessment of the three major site 
groups, we analyzed broad functional differences among the 20 
sites using relative TIGRFAM protein family abundances present 
in the metagenomes. TIGRFAM protein families represent a 
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relatively complete set of manually curated prokaryotic protein 
family models, which makes them suitable for the type of compar- 
ison we sought to perform. Abundance data, which were weighted 
by the scaffold assembly coverage, were obtained by counting 
protein copy numbers in assembled scaffolds for each site. Prin- 
cipal Component Analysis (PCA) of the TIGRFAM abundance 
data (Figure 6) showed that the first two principal components 
accounted for 80.1% of the functional variation among sites, and 
a two component plot (PCI versus PC2) clearly separates the three 
major types of sites included in the study (phototrophic mats, 
"filamentous streamer communities," and archaeal-dominated S 
or Fe sediments). In particular, Component 2 appeared to mea- 
sure the relative fraction of bacterial versus archaeal sequence in 
the metagenome. For example, although sites DS_9 and OSP_14 
contain a dominant Aquificales population, these sites were more 
similar to the archaeal group because of the presence of sub- 
dominant (<30%) archaeal populations (albeit different archaeal 
phylotypes in each site). Although site OS_ll also contained a 
major Aquificales population, the community contained signif- 
icant sub-populations of novel bacterial groups that were also 
found in phototrophic mats (e.g., Bacteroidetes, Firmicutes), and 
thus fell closer to the phototrophic group. Site WS_1 8 contained at 



least three major archaeal populations and was included with the 
archaeal group, but this sediment sample also contained at least 
two major bacterial populations that plotted away from the main 
archaeal cluster in TIGRFAM PCA space (Figure 6). 

Comparing groups, the Aquificales "streamer" communities 
and archaeal-rich habitats yielded broader functional diversity 
than phototrophic sites because the chemotrophic sites included 
both bacteria and archaea while the phototrophic sites were dom- 
inated by bacteria. The phototrophic mats contained very few 
sequences attributable to archaea or members of the Aquificales. 
Component 1 reflects the amount of Aquificales versus pho- 
totrophs (e.g., important for bacterial sites) as well as the amount 
of sequence data attributable to members of the Sulfolobales 
(especially relevant for archaeal sites), in which the content of 
Sulfolobales decreased from left to right (see PCI, Figure 6). Com- 
ponent 3 separates two extreme sites (CH_1 and NL_2) that were 
dominated by Sulfolobales (Figure 5) from the majority of other 
sites, although PC3 explains only 5.9% of the variation across sites. 

To identify what functional categories underlie the differences 
among sites, we obtained average TIGRFAM (Selengut et al, 2007) 
abundances for each functional category and site, and clustered 
these values using two-way hierarchical clustering (Figure 7). 
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FIGURE 6 | Principal components analysis of normalized TIGRFAM 
protein family abundance data across all 20 sites. The three panels show 
pairwise plots of the first three principal components (PC1 and PC2 account 



for 80% of the variation across sites, while PC3 only represents ~6%). The 
first and second components separate sites into the three main habitat types 
studied (site labels and color groups are described n Table 1). 
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FIGURE 7 | Two-way hierarchical clustering of normalized 
TIGRFAM protein family abundance data averaged across 
intermediate-level TIGRFAM functional categories The data was 
standardized (subtract mean and divide by standard deviation) across 

The site clustering tree based on TIGRFAM abundances showed 
remarkable similarity to the decision-tree (Figure 3), which indi- 
cated that the environmental variables separating the major habitat 
types (especially pH, temperature, and reduced sulfur versus oxy- 
gen) are key determinants of microbial community structure and 



sites before clustering so that the color scale units are in standard 
deviations from the mean across sites. Yellow colors correspond to 
values that are higher than the site mean and blue colors to values 
that are lower than the mean. 



function. The TIGRFAM categories that varied most significantly 
among sites included a broad array of proteins important in cen- 
tral metabolism, cell replication, motility, photosynthesis, electron 
transport, and other metabolic functions (Figure 7). The observed 
site separation within a TIGRFAM category could result from 



www.frontiersin.org 



May 2013 | Volume 4 | Article 67 | 9 



Inskeep et al. 



TheYNP metagenome project 



several factors that would not be evident without further investi- 
gation. These include the possibility that house-keeping functions 
simply reflect distinct phylogenetic differences observed across 
sites, and/or alternatively, that the range in abundance within 
a category may reflect the presence versus absence of specific 
metabolic capabilities contributed by phyla unique to a site. Con- 
sequently, the separation of sites using broad functional categories 
(Figure 7) accounts for both phylogenetic and functional differ- 
ences, and provides strong support for the major site groupings 
emphasized in the three accompanying articles (Inskeep et al., 
2013a,b; Takacs-Vesbach et al., 2013). 

Specific TIGRFAM categories that varied in abundance across 
site groups included differences in functional categories such as 
nitrogen fixation and photosynthesis, expected to be highly rep- 
resented in phototrophic sites (Figure 7). In contrast, TIGRFAMS 
that included cofactor biosynthesis (folic acid and lipoate), sur- 
face structures, fatty-acid biosynthesis, and chemotaxis/motility 
were more abundant in "streamer" communities. The archaeal sites 
exhibited a greater abundance of TIGRFAMs that included RNA 
processing, amino acid biosynthesis, nitrogen metabolism, aerobic 
respiration, and detoxification (Figure 7). In some cases, differ- 
ences in TIGRFAM abundance across sites may have resulted from 
phylogenetic differences; for example, archaeal genes encoding 
a specific function may not be recognized as part of a TIGR- 
FAM category that has been established primarily from bacterial 
genomes. TIGRFAM categories that result in the greatest sepa- 
ration among the 20 sites included signal transduction, nitrogen 
fixation and regulatory functions, which represent a greater pro- 
portion of total sequences in phototrophic sites versus either 
Aquificales or archaeal communities (Figure 7; Figure S2 in Sup- 
plementary Material shows two of these TIGRFAM categories in 
greater detail). 

More fine-grained functional differences among sites were 
obtained by only considering TIGRFAMs in a specific category 
such as "Electron Transport" (Figure 8). The site clustering based 
on the relative abundance of different electron transport domains 
provided a different view of the variation in functional attrib- 
utes across sites. The distribution of specific respiratory com- 
plexes across sites [e.g., heme Cu oxidases (HCO), cytochrome 
foci-ubiquinol oxidases, blue Cu proteins, NiFe-hydrogenases, and 
nitrite/nitrate reductases] correlates more closely with geochemi- 
cal parameters (e.g., oxygen, sulfur, hydrogen). Consequently, the 
distribution of different electron transport proteins across sites 
did not result in identical site clustering obtained using broader 
TIGRFAM categories (Figure 7). These observations are consis- 
tent with the fact that each group (i.e., phototrophic, archaeal, 
streamer communities) contained a range of sites with variable 
levels of sulfide or oxygen. Consequently, the factors contributing 
to functional variation within each of the three main site groups 
require additional dissection to appreciate how changes in these 
attributes are correlated with specific phylotypes (Inskeep et al., 
2013a,b; Takacs-Vesbach et al, 2013). 

The distribution and breadth of protein families observed in 
the current study represents a significant contribution to total 
protein diversity observed in all metagenomes currently found in 
the Integrated Microbial Genomes and Metagenomes (IMG/M) 
database (Markowitz et al, 2012) (as of spring 2012). This is 



due primarily to the fact that the current study focused on high- 
temperature communities rich in Aquificales, other deeply rooted 
novel bacteria, and numerous lineages of archaea. The functional 
diversity of the 20 Yellowstone geothermal samples included in 
this study was compared to other types of microbiomes as repre- 
sented by the metagenomic sequence sets available in IMG/M, and 
based upon the TIGRFAM protein family abundance profiles for 
the assembled metagenomes. YNP samples are clearly separated 
from soil, water, gut, and other types of microbial communities 
in the public database (Figure S3 in Supplementary Material). As 
might be expected, the archaeal-dominated sites are most distant 
from the majority of public metagenomes, as few metagenome 
studies have targeted communities with significant archaeal pop- 
ulations. For example, the Crater Hills (CH_1) and Nymph Lake 
(NL_2) sites contained 1-2 predominant Sulfolobales populations 
and represented one extreme in the TIGRFAM PCA plots. More- 
over, nearly identical samples obtained ~2 years prior to this study 
(Inskeep et al, 2010) from four of the sites discussed here (CH_1, 
JCHS_4, CS_12, MHS_10) grouped with their expected replicate 
using PCA of TIGRFAM categories. Phototrophic mats and Aquif- 
icales "streamer" communities, which are dominated by eubacte- 
ria, fall closer to publicly available environmental metagenomes. 
The first principal component (which accounted for 24.8% of total 
variability in the TIGRFAM protein family abundance data across 
all sites in the database) almost entirely represents variation that 
exists within the YNP samples alone. The dramatic segregation of 
TIGRFAMS from YNP metagenomes versus previously sequenced 
mesophilic community metagenomes is indicative of the extensive 
contributions attributable to previously undescribed functional 
diversity. 

SUMMARY 

The metagenomes and corresponding metadata discussed here 
and in the accompanying articles provides a significant foundation 
of phylogenetic and metabolic information relevant to a wide 
range of geothermal ecosystems in YNP. Although the total 
sequence data obtained on an individual site basis was not suf- 
ficient to achieve assembly of all major phylotypes present in 
these distinct habitats, this study makes a considerable step toward 
understanding how microbial community structure and metabolic 
potential vary across a wide range of environmental parame- 
ters. A comparative distribution of protein families identified 
in the metagenome sequence reflects the major differences in 
phylogenetic structure among three primary groups of sites stud- 
ied: phototrophic mats, Aquificales "streamer"-communities, and 
archaeal-dominated sediments. The systematic selection of geo- 
chemically distinct sites provides an enormous opportunity to link 
individual phylotypes (and associated metabolic attributes) with 
specific physicochemical properties of the habitats they occupy. 

MATERIALS AND METHODS 
GE0CHEMICAL ANALYSIS 

Twenty geothermal sites in Yellowstone National Park (Table 1) 
were sampled and characterized in 2007-2008 (Figure SI in Sup- 
plementary Material). Parallel samples of the bulk aqueous phase 
(<0.2 (im) and sediment intimately associated with the microbial 
community were obtained simultaneously and analyzed using a 
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FIGURE 8 | Hierarchical clustering of metagenomes using relative TIGRFAM family abundances of genes in theTIGRFAM category "Electron 
Transport". Note that the abundance data has not been standardized, but rather log2-transformed (absolute abundances are shown). 



combination of field and laboratory methods. Temperature, pH, 
and redox-sensitive species (Fe 2+ /Fe 3+ ; total dissolved sulfide; 
dissolved O2) were determined using field methods (with some 
exceptions given the number and diversity of sites studied) as 
described in more detail in previous reports (Langner et al, 
2001; Macur et al., 2004). Total dissolved ions were determined 
using inductively coupled plasma (ICP) spectrometry and ion 



chromatography (for all major cations, anions and trace elements). 
Dissolved gases (CO2, H2, CH4) were determined using closed 
headspace gas chromatography (Inskeep et al., 2005) of sealed 
serum bottles ( <0.2 (im) obtained in the field without air contact. 
A complete dataset of geochemical information corresponding to 
these samples is provided (Table S2 in Supplementary Material). 
Selected sediment and microbial mat samples were analyzed using 
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scanning electron microscopy (Phillips Field Emission-SEM) in 
combination with energy-dispersive analysis of x-rays (EDAX) 
as well as x-ray diffraction (XRD). Thermodynamic calculations 
adjusted for temperature effects and performed using site-specific 
activities of dissolved and solid-phase constituents (Amend and 
Shock, 2001) showed that numerous oxidation-reduction reac- 
tions are exergonic (i.e., energy-yielding) in these types of geot- 
hermal systems and could support chemolitho- or chemoorgano- 
trophic metabolisms (Amend et al., 2003; Inskeep et al., 2005; 
Shock et al., 2005). The oxidation of energy-rich, reduced con- 
stituents such as H2, CH4, H2S, S°, and As(III) is extremely 
favorable when geothermal waters are exposed to atmospheric O2, 
or in some cases using alternate electron acceptors such as nitrate, 
ferric Fe, sulfate, or elemental S°. Dissolved Fe(II) and ammonium 
(NH4) concentrations can be very high in certain YNP geothermal 
systems, and offer yet another set of exergonic reactions that serve 
as a potential geochemical niche for chemotrophic organisms. 

DNA EXTRACTION AND LIBRARY CONSTRUCTION 

A standard DNA extraction protocol was used for the majority of 
samples, however, several sulfur sediments were ultimately sub- 
jected to various extraction kits to generate sufficient DNA yields 
for library construction. Our main emphasis was to obtain repre- 
sentative, unbiased environmental DNA for construction of small- 
insert libraries. Briefly, 3-25 g wet samples were extracted with 

1 ml of Buffer A (200 mM Tris, pH 8; 50 mM EDTA; 200 mM NaCl; 

2 mM sodium citrate; 10 mM CaCh ) with lysozyme ( 1 mg/ml final 
concentration) for 1.5 h at 37°C. Proteinase K (final concentra- 
tion 1 mg/ml) and SDS [final concentration 0.3% (w/v)] were 
then added and incubated for 0.5 h at 37°C. This first lysate was 
removed and the samples were re-extracted using bead-beating 
protocols. The two lysates were combined and extracted with 
phenol-chloroform, and the resulting DNA was re-precipitated in 
ethanol, treated with RNAase and quantified by gel electrophoresis 
and staining. Our intent was to avoid biasing the samples against 
organisms that may be difficult to lyse with chemical methods. 
The extraction kits (e.g., MoBio) used on several of the sulfur- 
containing sediment samples also included a modest physical lysis 
step. Across all 20 sites, the extracted DNA ranged in size from 
~4 to 12kbp, and for many samples these procedures resulted in 
higher MW DNA ranging from 9 to 20 kbp. 

RANDOM SHOTGUN SEQUENCING 

Short-insert (~3kbp) pUC18 libraries were constructed from all 
sites, and preliminary sequencing of 5-10 megabases (Sanger) , was 
performed as a quality control to determine if the initial sequenc- 
ing results [via MEGAN (Huson et al, 2007) and blastx analysis] 
were consistent with the sample origin. All samples passed this 
quality checkpoint and were further sequenced to produce a total 
of 40-60 Mbp per site (Figure 4), with the exception of two 
sites. Both MHS_10 and NL_2 contained only one major pop- 
ulation type that was covered sufficiently with ~20 Mbp of Sanger 
sequence and these sites were also included for a half plate of 454 
titanium pyro-sequencing (~225Mbp per sample), along with 
two additional samples (MG_3, JCHS_4). When combined with 
the total amount of Sanger sequencing performed (871 Mbp), this 
collectively represented nearly 2 Gbp of random shotgun sequence. 



16S rRNA genes were also amplified from the DNA of each site 
using universal primers specific for bacteria and archaea, and one 
384-well plate was sequenced from each successful library. Results 
from 16S rRNA gene sequencing were generally consistent with 
the phylogenetic signatures observed in the metagenome data, 
with the exception of one site (NL_2), presumably due to PCR 
amplification bias. For NL_2, the majority of 16S rRNA clones 
were Thermoproteales-like, but the random shotgun sequence 
data showed a predominant Sulfolobales-like population. 

SEQUENCE ASSEMBLY 

Metagenome assembly was conducted using two approaches 
(Celera and PGA), which resulted in reasonably similar over- 
all assembly statistics. Due to the slightly better scaffold con- 
struction and prior history using the Celera assembler (Rusch 
et al., 2007; Inskeep et al., 2010), these assemblies were sub- 
jected to more detailed phylogenetic and functional analyses. 
However, the presence of genes or protein families described 
here are found in both assemblies, and the major results dis- 
cussed here are not significantly different using either set of 
assembled data. Automated tools in IMG were used for gene 
identification and annotation for both sets of assembled data, 
and both sets of data are available on IMG/M. The following 
parameters were used for the Celera Assembler (Version 4.0): 
doOverlap Trimming = 0, doFragmentCorrection = 0, globalEr- 
rorRate=12, utgErrorRate = 150, utgBubblePopping = 1, and 
useBogUnitig = 0. For PGA assemblies (Zhao et al, 2008), the 
following parameters were employed: OverlapLen = 30; Per- 
cent =0.75; Clearance = 30; Clipldn = 77; ClipQual=10; Cut- 
offScore = 400; EndOverhang = 800; InOverhang = 500; Min- 
CovRep = 50; MinLinks = 2; MinSat = 3; Numlter = 50; Penal- 
izeN=l; QualOverLim = 400; QualScoreCutoff = 200; Qual- 
SumLim = 3500; SimDiFac = 30; Verbosity = 1. 

As mentioned above, four sites (NL_2, MG_3, JCHS_4, 
MHS_10) received both Sanger sequence and one-half plate of 
454 titanium pyro-sequencing (also available on IMG/M). Due to 
the success of Sanger sequencing in generating significant assem- 
blies from these four sites, the additional pyro-sequence is not 
discussed in great detail in the subsequent manuscripts. In all 
four of these sites, contig coverage increased significantly when 
the pyro-sequence reads were included. However, the analyses dis- 
cussed here are mostly derived from the Sanger sequence data, in 
order to maintain a consistent analytical approach for the data 
from all 20 sites. 

PHYLOGENETIC ANALYSIS OF ENVIRONMENTAL SEQUENCE DATA 

Phylogenetic analysis of random shotgun sequence reads was per- 
formed using a number of different approaches, and these yielded 
convergent information regarding the predominant phylotypes 
present in these geothermal sites. Individual sequence reads were 
analyzed using G + C content (mol%) distribution coupled with 
blastx and MEGAN (Huson et al, 2007) analysis. Genome-level 
phylogenetic analysis was accomplished using fragment recruit- 
ment of environmental sequence data to reference microbial 
genomes (Rusch et al, 2007). At the time of writing, the database 
contained reference microbial genomes for ~1500 bacteria and 
100 archaea, however, only a handful of microbial genomes 
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currently serve as appropriate references for the indigenous 
organisms within these communities. Assembled metagenome 
sequences were also analyzed using three dimensional PCA plots of 
nucleotide word frequencies (Teeling et al., 2004) with a simultane- 
ous phylogenetic classification based on APIS or on a blast-based 
classification (Badger et al., 2006; Rusch et al, 2007). Metagenome 
sequence reported here can be viewed with these utilities at 
http://gos.jcvi.org/openAccess/scatterPlotViewer.html. 

HIGH-LEVEL TIGRFAM ANALYSIS OF ENVIRONMENTAL SEQUENCE 
DATA 

Assembled scaffolds were annotated as described in Inskeep et al. 
(2010) and predicted proteins from the scaffolds were assigned to 
TIGRFAM protein families (Selengut et al, 2007) using HMMER 
3 (Eddy, 2011) with e-value cutoff of le-6. TIGRFAM family 
counts for each scaffold were multiplied by the average cover- 
age for the scaffold and the coverage weighted counts across all 
scaffolds for a particular site were summed to obtain estimated 
total family counts for a site. Different sites were normalized so 
that the total count over all TIGRFAM protein families was kept 
constant. TIGRFAM families were categorized using a two-level 
functional classification described on the TIGRFAM website 2 . 
PCA and statistical analysis of site group differences was per- 
formed using the STAMP v2.0 software (Parks and Beiko, 2010). 
The AN OVA test and Benjamini-Hochberg FDR correction imple- 
mented in STAMP was used to test for differences between multiple 
site groups. Two-way clustering was done on row-standardized 
(across sites) average TIGRFAM category abundance data using 
the Euclidean distance metric and complete-linkage hierarchical 
clustering. The MeV 4.8 (Saeed et al., 2003) software package 
was used for clustering and visualization. For comparison with 
other metagenomes in the IMG system, TIGRFAM profiles were 
generated for all the sites in this study and other metagenomic 
data sets downloaded using the IMG web interface. 

DETAILED FUNCTIONAL ANALYSIS OF ENVIRONMENTAL 
METAGENOMIC SEQUENCE DATA 

The assembled metagenome sequence data was also screened for 
specific functional genes corresponding to known or putative 
pathways in material and energy transfer. We were specifically 
interested in assessing metabolic potential for chemolithoautotro- 
phy (CO2 fixation and electron transfer) in high-temperature 
geothermal systems. Query DNA sequences known to code for 
proteins important in the oxidation of reduced chemical con- 
stituents or the reduction of a terminal acceptor (Table S3 in 
Supplementary Material) were used to search the environmen- 
tal sequence data. Environmental sequence fragments exhibiting 
sequence similarity (e-values < 10~ 10 ) to query sequences were 
then reanalyzed using blastp, and assessed individually using phy- 
logenetic analysis of deduced protein sequences against known 
relatives, as well as fragment length relative to query length. 
False positives were minimized using this screening process. This 
included (i) sequences matching the correct protein family of the 
query sequence, but not the exact query sequence (e.g., Mo-pterin 
oxidoreductases versus a specific protein within this family); (ii) 



2 http://www.jcvi.org/cgi-bin/tigrfams/index.cgi 



sequences matching a query sequence due to regions of sequence 
similarity, but were clearly associated with a gene or gene cluster 
with different function; and (iii) sequences that returned mis- 
annotated blastp relatives. It is also possible that our inventory of 
metabolic potential missed sequences related to a specific query 
gene. For example, some genes found in the metagenome data were 
of insufficient length relative to a specific query sequence ( <40%) 
to make a definitive assignment. Moreover, the lower depth of cov- 
erage (<lx) of sub-dominant phylotypes precluded a complete 
functional analysis of these organisms. 

SEQUENCE AVAILABILITY 

All annotated metagenome sequence assemblies (Celera/PGA) 
are available through the DOE-JGI IMG/M website 3 under 
IMG taxon OID numbers as follows (site order is identical 
to that presented in Table 1): Phototroph Sites [YNPSite06 
(2022920004/2013515000), Site07 (2022920013/2014031006), 
Sitel5 (2022920016/2015219002), Sitel6 (2022920018/2016842 
003), Site05 (2022920003/2013954000), Site20 (2022920020/2016 
842008) , and Sitel 7 (202292002 1/20 1 6842005) . ] ; Aquificales Sites 
[YNPSite09 (2022920010/2014031004), Sitel4 (2022920007/2013 
954001), SitelO (2022920015/2015391001), Sitel2 (2022920011/20 
1403 1005), Sitel 1 (2022920012/2014031007), andSitel3 (2022920 
006/2013515002).]; Archaeal Sites [YNPSiteOl (2022920009/2014 
031002), Site02 (2022920014/2015219001, 2016842002), Site03 
(2022920002/2014031003,2016842001), Sitel9 (2022920017/2015 
219000), Site04 (2022920008/2013843003), Sitel8 (2022920019/20 
16842004), and Site08 (2022920005/2013515001).]. 

ACKNOWLEDGMENTS 

Authors appreciate support from the National Science Founda- 
tion Research Coordination Network Program (MCB 0342269), 
the DOE-Joint Genome Institute Community Sequencing Pro- 
gram (CSP 787081) as well as all individual author institutions 
and associated research support that together has made this study 
possible. The work conducted by the U.S. Department of Energy 
Joint Genome Institute is supported by the Office of Science of 
the U.S. Department of Energy under Contract No. DE-AC02- 
05CH11231. Authors appreciate the research permit focused on 
the YNP metagenome project (Permit No, YELL-5568, 2007- 
2010), and managed by C. Hendrix and S. Guenther (Center for 
Resources, YNP). 

SUPPLEMENTARY MATERIAL 

The Supplementary Material for this article can be found 
online at http://www.frontiersin.org/Microbial_Physiology_and_ 
Metabolism/ 1 0.3389/fmicb.20 13.00067/abstract 

Table S1 | Contributions of NSF Research Coordination Network Steering 
Committee and Working Group members to the Yellowstone Metagenome 
Community Sequencing Project (DOE JGI CSP 787081). 

Table S2 | Geochemical parameters measured in the bulk aqueous 
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potential metabolic attributes associated with the predominant 
phylotypes found within these geothermal sites. 



Figure SI | Additional site photographs emphasizing landscape context of 
geothermal habitats and field sampling efforts (included as a separate file 
containing 53 annotated photographs). 



Figure S2 |Top ranked TIGRFAM categories based on their ability to 
differentiate the three types of sites discussed in this study: aquificales 
"streamer" communities (blue), archaeal-dominated sediments (yellow). 



and phototrophic mats (green). (A) Signal transduction; (B) Regulatory 
functions (ANOVA test with Benjamini-Hochberg FDR multiple testing 
correction). 

Figure S3 | Principal components analysis of normalized TIGRFAM protein 
family abundance data across all 20 sites as well as a subset of diverse 
metagenomic datasets from IMG. The metagenomes are colored based on 
rough classification of the environments from which the sample originates: 
Yellowstone sites in this study = red; Yellowstone sites from other 
studies = orange; sediments^ turquoise; so/7s=pink; aquatic = light-blue; 
fungus-garden = green; animal-gut = brown; enrichment 
community = light-orange; other samples = gray. 
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